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Problem  Statement 

While  our  ability  to  gather  vast  amounts  of  video  data  is  growing  at  a  staggering  rate,  our  ability  to  effectively  store,  process,  and 
analyze  this  video  has  not  kept  pace.  It  is  therefore  necessary  to  develop  automatic  methods  for  allocating  limited  resources  in  video 
understanding.  It  particular,  it  is  important  to  reason  about  which  portions  of  video  require  expensive  analysis  and  storage. 

We  focused  on  three  important  video  understanding  problems.  First,  we  examined  low-level  vision  tasks  in  which  moving  objects  are 
separated  from  the  background.  Second,  we  made  use  of  object  recognition  algorithms  for  locating  and  identifying  specific  objects  of 
interest.  Finally,  we  looked  at  the  problem  of  identifying  activities  that  are  characterized  by  patterns  that  occur  across  space  and  time, 
such  as  repeatedly  perfonning  an  action.  We  attack  these  problems  in  the  context  of  both  single  camera  video  and  video  from  small 
scale  networks.  For  each  of  these  tasks,  we  make  use  of  a  range  of  algorithms,  both  existing  and  novel,  that  offer  trade-offs  in  cost 
and  accuracy.  We  develop  inference  algorithms  that  allow  us  to  deploy  cheap,  noisy  algorithms  and  then  reason  about  which  portions 
of  video  require  more  expensive  processing. 

Summary  of  Approach 

This  project  aims  to  make  these  inferences  using  new  and  existing  tools  from  Statistical  Relational  Learning  (SRL).  SRL  is  a 
recently  emerging  technology  that  enables  the  effective  integration  of  statistical  or  probabilistic  information,  with  relational  or  logical 
domain  information,  providing  the  ability  to  reason  collectively  about  large,  complex,  interacting  domains.  Graphical  models  are 
used  to  represent  relational  information  captured  in  the  statistical  dependencies  between  the  information  available  in  different 
portions  of  video.  Then,  methods  from  SRL  are  used  to  integrate  information  in  a  large  video  data  set  and  to  reason  about  the  label 
acquisition  problem,  which  tells  us  which  additional  information  will  be  most  valuable. 

We  consider  both  vision  algorithms  and  graphical  models  that  range  in  from  the  simple  and  to  the  complex.  First,  we  developed 
systems  that  can  handle  background  subtraction  and  object  recognition  tasks  in  a  single  video  stream  using  Hidden  Markov  Models. 
This  relatively  simple  graphical  model  allows  us  to  efficiently  perform  inference  optimally.  Next,  we  developed  methods  for 
reasoning  about  background  subtraction  and  object  identification  tasks  in  more  complex  graphical  models  with  known  topology. 

These  can  be  applied  to  networks  of  cameras  in  which  the  relationship  between  images  in  each  camera  can  be  learned  offline. 

Optimal  inference  is  generally  intractable  in  these  models,  but  we  will  explore  the  use  of  approximate  algorithms  to  allow  us  to 
control  processing  in  these  settings.  Finally,  we  explored  the  more  complex  problem  in  which  the  topology  of  the  graphical  model 


must  be  learned  on  the  fly.  This  is  important  for  networks  with  moving  cameras,  or  for  inference  in  very  large  (eg.,  gigapixel) 
cameras,  in  which  a  graphical  model  should  be  constructed  on-the-fly  to  represent  individuals  objects  of  interest  and  their 
relationships. 

Scientific  Barriers 

Coping  with  large  amounts  of  video  data  presents  two  fundamental  problems.  The  first  concerns  computational  resources.  Modern 
cameras  and  networks  of  cameras  can  generate  such  huge  volumes  of  data  that  it  is  not  possible  to  apply  the  most  expensive,  state-of- 
the-art  algorithms  to  every  frame  of  video.  This  is  particularly  true  when  we  need  to  evaluate  new  queries  (eg.,  “Did  a  blue  Ford 
pickup  and  a  yellow  Corrolla  ever  drive  along  the  same  path,  stopping  at  the  same  parking  lot  at  different  times?”)  on  hours  or  days 
of  video  that  have  been  previously  collected.  In  such  cases  even  real-time  processing  is  not  adequate,  and  it  is  essential  to  develop 
methods  that  control  processing,  directing  it  to  the  portions  of  video  that  are  most  relevant  to  the  task  at  hand.  The  second  concern  is 
that  different  portions  of  large  video  collections  may  be  related  in  complex  ways.  Two  distinct  portions  of  video  may  be  related 
because  they  show  the  same  object,  or  related  actions.  Objects  may  pass  from  one  camera  to  another  with  some  time  delay,  which 
may  vary  depending  traffic  patterns,  which  vary  over  time.  Our  research  therefore  focuses  on  developing  methods  to  learn 
appropriate  models  of  these  relationships,  and  on  using  these  to  perform  inference  and  to  reason  about  resource  allocation. 

Significance 

Video  surveillance  is  critical  in  many  military  settings.  Current  systems  collect  huge  amounts  of  video  data,  which  must  be  annotated 
and  evaluated  by  human  operators  or  analysts.  Not  only  is  this  manual  analysis  costly,  but  constrained  resources  significantly  limit 
the  amount  of  analysis  that  can  be  perfonned.  It  is  also  currently  very  difficult  to  mine  video  collections  to  find  unusual  or  telling 
patterns  (e.g.,  “Is  there  a  location  associated  with  a  significant  number  of  cars  that  turn  around  when  approaching  an  unanticipated 
roadblock?”)  .  Improved  methods  for  automatic  video  analysis  will  greatly  amplify  our  ability  to  process  and  mine  this  video. 

Summary  of  Most  important  Results 

•  Completed  algorithms  and  experiments  for  motion  detection  and  object  recognition  in  a  single  video. 

•  Algorithm  selects  frames  to  apply  expensive  detection  algorithms.  Provably  approximates  optimal  choices  very 
efficiently. 

•  Our  new  algorithms  significantly  outperfonn  all  baseline  algorithms,  producing  more  accurate  results  with  much  less 
processing  required. 


•  Paper  published  in  NIPS  Workshop  on  Adaptive  Sensing ,  Active  Learning  and  Experimental  Design:  Theory,  Methods 
and  Applications,  2009. 

•  Paper  accepted  for  publication  by  the  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence. 

Developed  Reflect  and  Correct  algorithm  for  complex  graphical  models.  Directs  human  attention  to  classify  a  small  number 
of  objects  that  provide  the  most  information  about  other,  related  objects. 

•  New  method  outperforms  several  state-of-the-art  algorithms. 

•  Paper  published  in  International  Conference  on  Knowledge  Discovery  and  Data  Mining  conference,  2009. 

•  Best  student  paper  award. 

•  Paper  published  at  the  National  Conference  on  Artificial  Intelligence  (AAAI  NECTAR  Track),  2010. 

•  Paper  published  in  the  ACM  Transactions  on  Knowledge  Discovery  from  Data,  Volume  3,  Number  4,  page  1-32, 
November  2009. 

Developed  new  algorithm  for  active  learning  in  network  data 

•  Paper  published  in  International  Conference  on  Machine  Learning,  2010. 

•  Paper  on  “Query-driven  Active  Surveying  for  Collective  Classification,"  with  Galileo  Mark  Namata,  Ben  London, 

Lise  Getoor,  Bert  Huang,  in  ICML  Workshop  on  Mining  and  Learning  with  Graphs,  June  2012. 

Developed  new  algorithm  for  enforcing  visibility  constraints  in  object  recognition  from  multiple  viewpoints. 

•  Paper  published  in  IEEE  Conference  on  Computer  Vision  and  Pattern  Recognition,  2009. 

Developed  algorithm  for  object  identification  in  a  camera  network. 

•  We  apply  a  graphical  model  to  the  camera  network  to  allow  information  to  be  integrated  across  space  and  time. 

•  We  developed  an  inference  algorithm  that  can  combine  low-level  image  matching  and  user  input  to  find  video  frames 
likely  to  satisfy  a  query. 

•  We  perform  active  inference  to  detennine  which  frames  to  ask  a  human  operator  to  label.  We  have  adapted  the 
Reflect  and  Correct  algorithm  so  that  it  can  perform  this  function  in  camera  networks.  We  show  that  this  algorithm 
outperforms  other  approaches. 

•  Paper  published  in  the  IEEE  Workshop  on  Person-Oriented  Vision,  2011. 

Developed  method  for  improving  object  detection  by  reducing  the  complexity  of  latent  variable  models  using  group  norm 
regularization.  Applied  to  deformable  parts  models  for  detection  of  people  and  vehicles. 

Paper  submitted  to  the  International  Conference  on  Computer  Vision,  2013. 

Three  students  completed  their  PhDs  at  UMD  and  were  supported  by  this  project: 

•  Mustafa  Bilgic,  now  assistant  professor  at  TTI 

•  Daozheng  Chen,  now  research  engineer  at  Yahoo! 

•  Galileo  Namata,  now  research  scientist  at  Verisign 


•  SRL  Tutorials  given: 

•  Invited  Tutorial,  Conference  on  Neural  Information  Processing  (NIPS),  December  2012. 

•  “Learning  Statistical  Models  from  Relational  Data,"  ACM  International  Conference  on  Management  of  Data 
( SIGMOD ) ,  Athens,  GR,  June,  2011. 

•  “Exploiting  Statistical  &  Relational  Infonnation  on  the  Web  and  in  Social  Media,"  Eleventh  SIAM  International 
Conference  on  Data  Mining  (SDM),  Phoenix,  AZ,  April,  2011. 

•  “Exploiting  Statistical  &  Relational  Information  on  the  Web  and  in  Social  Media,”  Fourth  ACM  International 
Conference  on  Web  Search  and  Data  Mining  (WSDM),  Hongkong,  CH,  February,  2011. 

•  “Exploiting  Statistical  &  Relational  Infonnation  on  the  Web  and  in  Social  Media:  Applications,  Techniques,  and  New 
Frontiers,"  National  Conference  on  Artificial  Intelligence  (AAAI),  joint  with  Lily  Mihalkova,  Atlanta,  GA,  July,  2010. 

•  SRL  Survey  Article 

•  Lifted  Graphical  Models:  A  Survey,  Lilyana  Mihalkova  and  Lise  Getoor,  Machine  Learning  Journal,  30  pages, 
accepted  subject  to  minor  revisions. 

Collaborations  and  Leveraged  Funding 

•  Lily,  NSF  funding. 

•  Object  recognition  work  in  collaboration  with  researchers  at  the  University  of  Chicago  and  the  Weizmann  Institute. 

•  Object  recognition  in  collaboration  with  researchers  at  MIT,  TTI  and  Virginia  Tech. 

Conclusions 

We  have  demonstrated  that  inference  in  graphical  models  can  be  used  to  direct  resources  to  process  the  most  useful  pieces  of  data. 

We  have  developed  state  of  the  art  algorithms  for  active  inference  and  active  learning  in  network  data  and  have  applied  them  to 

camera  network  data. 


